Dpylr Package TutorialIn this tutorial, we will learn about the package dpylr
and how to apply it to the data we want to run a Statistical Analysis
on.
Gain a deeper understanding about
dyplrApply the
dyplrfunctions you learned to any set of data you are working withMake inferences using Anova test results
The dplyr is commonly using to manipulate data for
analysis. Given the phrase assigned with this package is “A Grammar of
Data Manipulation” it is an analysts most valuable tool when processing
data in R (Wickham et al.).
There are over 100 functions available, each aiding in the reconstruction of the data frame, BUT we will only using the following to analyze our set of data:
filter(): select specific rows
mutate(): create new columns or modify within the same data frame
rename(): change names of columns
group_by(): group data base on specified variable
summarize(): usually paired with a function like `group_by()`; helps achieve
calculated values for the certain groups
glimpse(): shows a brief summary of a dataset. For example, displays the
structure of a column such as whether it is a character or double
arrange(): reorder rows based on specified variable
desc() : paired with `arrange` function; displays column in descending order
We will be needed a couple of additional functions to help us
manipulate the data frame provided. The separate() function
from the tidyr package will be useful later on. We can
separate columns by specifying a delimiter function.
The piping operator %>% is apart of the
magrittr package and will be essential throughout the
cleaning process. Instead of writing lines of code and forgetting to
close parentheses (. This function, will simplify our
code.
We will be using the following sample data to practice the new functions we have learned.
# Creating sample data
student_data <- data.frame(
StudentID = c(1, 2, 3, 4, 5),
Full_Name = c("Doe, John", "Smith, Alison", "Johnson, Robert", "Brown, Evelyn", "Smith, Jamie"),
Math_Score = c(85, 72, 90, 65, 78),
Science_Score = c(92, 88, 75, 80, 85),
History_Score = c(78, 85, 70, 60, 92)
)
Now it is time to apply the functions. We will need to call the
packages. If you have not installed them, do so now using
install.packages("name_of_package").
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.4 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'magrittr'
##
##
## The following object is masked from 'package:purrr':
##
## set_names
##
##
## The following object is masked from 'package:tidyr':
##
## extract
Using the piping operator %>%, we can combine all of
these arguments
analyzed_data <- student_data%>%
select(Full_Name, Math_Score, Science_Score,History_Score)%>% #did not want studentID included
rename(Math= Math_Score, Science= Science_Score, History = History_Score)%>% #New name is on left of '=' & old name is on the right
mutate(total= Math+Science+History)%>% #Using the New names now
separate(Full_Name, into= c("Last_name", "First_name"), sep=",")%>% #1st call the variable you want to alter, then name the columns that will separate
group_by(Last_name)%>%
summarize(avg_math=mean(Math))%>%
arrange(desc(avg_math))%>%
filter(avg_math>70)%>%
glimpse()
## Rows: 3
## Columns: 2
## $ Last_name <chr> "Johnson", "Doe", "Smith"
## $ avg_math <dbl> 90, 85, 75
Now that you got a taste of using these functions, it is time to implement this in the larger data frame. For this tutorial, we will clean and manipulate data from the U.S. Census Bureau website (Bureau).
The data provided a sample of 2022 payroll from the government positions. The following variables were provided:
State Government position Full-time Payroll Count of Full-time Employers Part-time Payroll Count of Part-time Employers Total Payroll
Before I clean the data and decide on a research question, it is
important to explore the data to see what variables I will be dealing
with. I will do this my using the following functions in the Base R
package: view(), nrow(), ncol(),
length(names()). These functions will assist with my
exploration process and get me familiarized with the dataset I will be
working with.
#loading data in script
library(readxl)
salaries<- read_excel("2022_state.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
salaries
## # A tibble: 1,952 × 9
## STATE GOVERNMENT: EMPLOYMEN…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BY STATE AND BY FUNCTION: MA… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Preliminary Data Released: J… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 SOURCE: 2022 Census of Gover… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 7 https://www.census.gov/progr… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 8 their own estimates from the… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 9 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 10 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## # ℹ 1,942 more rows
## # ℹ abbreviated name: ¹`STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA`
head(salaries) #looking a the first few rows
## # A tibble: 6 × 9
## STATE GOVERNMENT: EMPLOYMENT…¹ ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BY STATE AND BY FUNCTION: MAR… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 Preliminary Data Released: Ju… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 SOURCE: 2022 Census of Govern… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## # ℹ abbreviated name: ¹`STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA`
nrow(salaries)/50 #how positions for 50 states
## [1] 39.04
ncol(salaries) #how many columns
## [1] 9
Note: Once the data was imported from Excel, the names of the columns did not transfer properly.
rename columns; starting w/ line 15The number of rows are way too large for analyzing. The total rows are 1992 and for each state, there are about 39 variables.condense and/or omit variablesAgain, to simplify results, we will be viewing on the total payroll for all forms of employment.filter necessary collumns and rows
After exploring the data, I was able to develop my research question.
Which state government pays the highest salary? As I stated in the previous section, it is important to have a plan before tackling your data. In order to find the highest salary, I will most likely use a graph for visualization and a table to look at the concrete numbers. After using visualization tools, I want to ensure that the conclusion is statistical significant and run a
one-way anova test. Alpha will equal 0.05. Therefore, if thep-value is less than 0.05, it will ensure it’s significance.
We will be calling the tidyverse package to utilize both
features in dplyrand ggplot2
Now we will isolate the 4th & 6th columns & rows 14:1992.
Then rename it to total_salaries. The select ,
slice, and mutate functions from the
dplyr package will help us execute this task.
salaries%>%
select("STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA","...2","...4","...6")%>%
slice(15:1992)%>%
filter(...6 > 0)%>% #decided to filter out the positions that had a zero total
mutate(full_time = "...4",
part_time = "...6",
states = "STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA",
positions = "...2"
)%>%
print()
## # A tibble: 1,106 × 8
## STATE GOVERNMENT: EM…¹ ...2 ...4 ...6 full_time part_time states positions
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 US Tota… 2438… 2441… ...4 ...6 STATE… ...2
## 2 US Fina… 9903… 1735… ...4 ...6 STATE… ...2
## 3 US Othe… 2929… 1770… ...4 ...6 STATE… ...2
## 4 US Judi… 1150… 2064… ...4 ...6 STATE… ...2
## 5 US Poli… 7606… 4653… ...4 ...6 STATE… ...2
## 6 US Poli… 5661… 6704… ...4 ...6 STATE… ...2
## 7 US Poli… 1944… 3982… ...4 ...6 STATE… ...2
## 8 US Corr… 2323… 2295… ...4 ...6 STATE… ...2
## 9 US High… 1149… 1861… ...4 ...6 STATE… ...2
## 10 US Air … 2546… 57589 ...4 ...6 STATE… ...2
## # ℹ 1,096 more rows
## # ℹ abbreviated name: ¹`STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA`
Ran into a problem when mutating the columns. For some reason each equates the whole column to assigned title and replaces the numeric values.
Let’s try this again…
So rename specifies renaming the column. ALWAYS remember
that mutate manipulates the rows in columns.
payroll <- salaries %>%
select("STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA", "...2", "...4", "...6") %>%
slice(15:1992) %>%
rename(
full_time = "...4",
part_time = "...6",
states = "STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA",
positions = "...2"
) %>%
filter(full_time != 0 ,
!positions%in% c(
"Total - All Government Employment Functions",
"Police Protection Total",
"Fire Protection Total",
"Education Total",
"Education - Elementary and Secondary Total",
"Education - Higher Education Total"
)
) %>% #Also decided to filter out sub totals and 0s in each column
filter(part_time != 0,
!positions %in% c(
"Total - All Government Employment Functions",
"Police Protection Total",
"Fire Protection Total",
"Education Total",
"Education - Elementary and Secondary Total",
"Education - Higher Education Total"
)
)
print(payroll)
## # A tibble: 898 × 4
## states positions full_time part_time
## <chr> <chr> <chr> <chr>
## 1 US Financial Administration 990359836 17356080
## 2 US Other Government Administration 292909004 17702042
## 3 US Judicial and Legal 1150539087 20648400
## 4 US Police Protection - Persons with Power of Arrest 566174779 670411
## 5 US Police Protection - Other 194445002 3982950
## 6 US Corrections 2323903777 22956231
## 7 US Highways 1149897906 18612575
## 8 US Air Transportation 25469702 57589
## 9 US Sea and Inland Port Facilities 39896158 109433
## 10 US Public Welfare 1131087361 21508957
## # ℹ 888 more rows
After cleaning the data, now it’s time to draw some inferences.
join_salary <- payroll%>%
group_by(states)%>%
summarize(mean_salary = mean(n()))
print(join_salary)
## # A tibble: 51 × 2
## states mean_salary
## <chr> <dbl>
## 1 AK 19
## 2 AL 19
## 3 AR 17
## 4 AZ 17
## 5 CA 18
## 6 CO 15
## 7 CT 18
## 8 DE 17
## 9 FL 18
## 10 GA 16
## # ℹ 41 more rows
Had trouble with finding the mean because the salaries are being treated as characters not doubles
glimpse(payroll)# full_time and part_time salaries are characters
## Rows: 898
## Columns: 4
## $ states <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", …
## $ positions <chr> "Financial Administration", "Other Government Administration…
## $ full_time <chr> "990359836", "292909004", "1150539087", "566174779", "194445…
## $ part_time <chr> "17356080", "17702042", "20648400", "670411", "3982950", "22…
join_salaries <- payroll %>%
mutate(
full_time = as.numeric(full_time),
part_time = as.numeric(part_time),
total_pay = full_time + part_time
)%>%
group_by(states)%>%
summarize(mean_salary = mean(total_pay))
head(join_salaries)
## # A tibble: 6 × 2
## states mean_salary
## <chr> <dbl>
## 1 AK 8079107.
## 2 AL 25724643.
## 3 AR 17215672.
## 4 AZ 24609665.
## 5 CA 213737235.
## 6 CO 36675423.
glimpse(join_salaries) # now full_time and part_time are doubles
## Rows: 51
## Columns: 2
## $ states <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA"…
## $ mean_salary <dbl> 8079107, 25724643, 17215672, 24609665, 213737235, 36675423…
Now it’s time to time mean and median of
total_pay.Also, add these columns together
#To include positions we had to add full_time and part_time with payroll dataset
complete_pay <- payroll %>%
mutate(
full_time = as.numeric(full_time),
part_time = as.numeric(part_time),
total_pay = full_time + part_time
)
position_calc_pay <- complete_pay %>%
group_by(positions) %>%
summarize(mean_pay = mean(total_pay, na.rm = TRUE), median_pay= median(total_pay)) %>%
arrange(desc(mean_pay))
head(position_calc_pay)
## # A tibble: 6 × 3
## positions mean_pay median_pay
## <chr> <dbl> <dbl>
## 1 Education - Higher Education Other 240348688. 100699491
## 2 Education - Higher Education Instructional 216376762. 79736800
## 3 Hospitals 108920803. 32047634
## 4 Corrections 91765010. 24000538.
## 5 Transit 84731883. 44594895
## 6 Highways 50517611. 15599646
state_calc_pay <- complete_pay%>%
group_by(states)%>%
summarize(mean_pay = mean(total_pay, na.rm = TRUE), median_pay= median(total_pay)) %>%
arrange(desc(mean_pay))
head(state_calc_pay)
## # A tibble: 6 × 3
## states mean_pay median_pay
## <chr> <dbl> <dbl>
## 1 US 924928786. 310611046
## 2 CA 213737235. 112976658
## 3 TX 99948947. 47047908
## 4 NY 82109160. 51256641
## 5 PA 55413687. 30355233
## 6 MI 54748541. 20324641
From the calculated
mean(), it appears that California has the highest mean salary out of the all of the states.
As a part of the tidyverse package, we will also be
utilizing the ggplot2 to help use visual aid in our
analysis.
library(tidyverse)
#Scatterplot of 50 States Salary
ggplot(join_salaries, aes(states, mean_salary)) +
geom_point()+ labs(title= " Means Salaries of AK-NH", x = "States", y = "Mean") +
coord_cartesian(xlim = c(0, 30)) #
ggplot(join_salaries, aes(states, mean_salary)) +
geom_point()+ labs(title= " Means Salaries of NH-WY", x = "States", y = "Mean") +
coord_cartesian(xlim = c(30, 51)) #
#Boxplot Plots
p<-ggplot(data=complete_pay, mapping = aes(x = states, y = total_pay, color= states)) +
geom_boxplot(show.legend = FALSE)+
#California looks the highest
labs(title= " Means Salaries of Each State", x = "States", y = "Count")
p
Since there are so many states, it is difficult for the end-user to
see all of the points. There is a feature that we tried in the
scatterplot, coord_cartesian(xlim= c(0,30)), allowing us to
change and manipulate the length. But as shown, we needed at least two
or more intervals. For a cleaner look, the plotly library
was a great addition to our presentation.
First I assigned the plot a name. Then I called the
ggplotly()function to convert the ggplot2 to plotly. This feature allows the end-user to interact with the data.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(p)
California can be expected to have a highest mean salary because it
may have the highest cost of living. Using the data from the
advisorsmith.com website,“City Cost of Living Index
Article”, we can confirm this conclusion(Mak).
## # A tibble: 502 × 5
## Lat Lon City states Cost_of_Living_Index
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 37.1 -88.6 Paducah " KY" 85.7
## 2 33.5 -86.8 Birmingham " AL" 90.7
## 3 34.2 -86.8 Cullman " AL" 90.5
## 4 30.6 -87.9 Daphne " AL" 96.6
## 5 34.6 -87.0 Decatur " AL" 88.6
## 6 31.2 -85.4 Dothan " AL" 86.3
## 7 34.8 -87.7 Florence " AL" 86.8
## 8 34.0 -86.0 Gadsden " AL" 86
## 9 34.7 -86.6 Huntsville " AL" 91.3
## 10 30.7 -88.0 Mobile " AL" 88.3
## # ℹ 492 more rows
var_data%>%
group_by(states) %>%
plot_geo(x = ~Lon, y = ~Lat) %>%
add_markers(size = ~Cost_of_Living_Index, text= ~Cost_of_Living_Index)%>%#marker size is determined by index value
layout(title = "Cost of Living Index by US City")
## Warning: `line.width` does not currently support multiple values.
mean_cost <- var_data%>%
group_by(states)%>%
summarize(mean_cost_of_living = mean(Cost_of_Living_Index))%>%
arrange(desc(mean_cost_of_living))%>%print()
## # A tibble: 51 × 2
## states mean_cost_of_living
## <chr> <dbl>
## 1 " HI" 151.
## 2 " CA" 126.
## 3 " DC" 120.
## 4 " MA" 115.
## 5 " VT" 113.
## 6 " RI" 112.
## 7 " AK" 109.
## 8 " CT" 109.
## 9 " CO" 108.
## 10 " WA" 107.
## # ℹ 41 more rows
From this table, we can see that California is the second highest cost of living. This further explains why mean_salary is expected to be so high.
plot<-ggplot(var_data, aes(y = Cost_of_Living_Index, x = states, color=states)) +
geom_boxplot(show.legend = FALSE) +
ggtitle("Cost_of_Living_Index in Each State")
library(plotly)
ggplotly(plot)
In order to confirm that our descriptive statistics has statistical significance, we will need to run tests. The common tests that can be ran in
Rare: t-test, linear regression, chi-squared test, and ANOVA.
We will be running two-way anova test without
interaction between the variables. This is an appropriate test for our
data because we want to know the relationship between the dependent
variable (total_pay) and two categorical independent-variables (states
and positions). Once we run the anova test, we will want to look at the
p-value. If it is under .05, then we claim that the state
and position group means are different. This indicates, that where the
position is located and the title of the job will have an impact on the
salary.
#Two_Way Anova Testing
sal_aov <- aov(total_pay ~ states+positions, complete_pay)
summary(sal_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## states 50 2.344e+19 4.687e+17 6.250 < 2e-16 ***
## positions 28 6.411e+18 2.290e+17 3.053 2.73e-07 ***
## Residuals 819 6.142e+19 7.499e+16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Two_Way Anova Testing w/o US
w_o_US_pay <- payroll %>%
mutate(
full_time = as.numeric(full_time),
part_time = as.numeric(part_time),
total_pay = full_time + part_time
)%>%
filter(states != "US")
w_o_US_aov <- aov(total_pay ~ states+positions, w_o_US_pay)
summary(w_o_US_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## states 49 9.645e+17 1.968e+16 9.46 <2e-16 ***
## positions 28 1.032e+18 3.687e+16 17.72 <2e-16 ***
## Residuals 791 1.646e+18 2.081e+15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Both p-values are under .05, which means we can reject the null hypothesis. The states and positions will have an impact on the individuals payroll in the year 2022. Another thing to pay attention to, is the US (sum for all states) was inclusion in the first two_way testing that was ran. The residual, distance from mean, was decreased once the US was removed. This makes sense because there is a lot of variation across the states. Therefore, this variable would silence a states’ influence on salary.
From this tutorial, you can see how dplyr plays an extensive role in
data analytics. As shown earlier, it would be difficult to catch under
lying problems without using our glimpse() function.
Additionally, we get a lot of information from our data by grouping our
data using the group_by() and summarize
function. Before using any data visualization, we knew that California
was going to have one of the highest means in the group. And we
discovered later on it was because of the their cost of living. Now to
make things more intriguing is that Hawaii was actually the highest cost
of living and California came second. That came as a surprise, because
it was not even in the top 5 mean salaries. We could have discover some
more data to see what jobs are more prevalent and/or lucrative in the
state of Hawaii. If we refer back to our ANOVA test, the location and
position tend to influence pay.
To dive deeper into the topics mentioned, please refer to these resources:
Plotly Package:http://127.0.0.1:21681/help/library/plotly/html/ggplotly.html
Dplyr Package:http://127.0.0.1:21681/help/library/dplyr/doc/base.html http://127.0.0.1:21681/help/library/dplyr/doc/grouping.html
https://researchmethod.net/anova/
GGplot2 Package:http://127.0.0.1:21681/help/library/ggplot2/doc/ggplot2-in-packages.html
Verbs that principally operate on rows.
arrange(): Order rows using column values
distinct(): Keep distinct/unique rows
filter(): Keep rows that match a condition
slice() slice_head()
slice_tail() slice_min()
slice_max() slice_sample() Subset rows using
their positions
Verbs that principally operate on columns.
glimpse: Get a glimpse of your data
mutate(): Create, modify, and delete columns
pull(): Extract a single column
relocate(): Change column order
rename() rename_with(): Rename columns
select(): Keep or drop columns using their names and
types
Groups: Verbs that principally operate on groups of
rows.
count() tally() add_count()
add_tally(): Count the observations in each group
group_by() ungroup(): Group by one or more
variables
dplyr_by: Per-operation grouping with
.by/by rowwise(): Group input by rows
summarise() summarize(): Summarise each
group down to one row
reframe(): Transform each group to an arbitrary number
of rows
n() cur_group() cur_group_id()
cur_group_rows() cur_column(): Information
about the “current” group or variable
Verbs that principally operate on pairs of data frames.
bind_cols(): Bind multiple data frames by column
bind_rows(): Bind multiple data frames by row
intersect() union()
union_all() setdiff() setequal()
symdiff(): Set operations
inner_join() left_join()
right_join() full_join(): Mutating joins
nest_join(): Nest join
semi_join() anti_join(): Filtering
joins
cross_join(): Cross join
join_by(): Join specifications
rows_insert() rows_append()
rows_update() rows_patch()
rows_upsert() rows_delete(): Manipulate
individual rows
Pair these functions with mutate(),
summarise(), filter(), and
group_by() to operate on multiple columns
simultaneously.
across() if_any() if_all():
Apply a function (or functions) across multiple columns
c_across(): Combine values from multiple columns
pick(): Select a subset of columns
Unlike other dplyr functions, these functions work on individual vectors, not data frames.
between(): Detect where values fall in a specified
range
case_match(): A general vectorised
switch() case_when(): A general vectorised
if-else
coalesce(): Find the first non-missing element
consecutive_id(): Generate a unique identifier for
consecutive combinations
cumall() cumany() cummean():
Cumulativate versions of any, all, and mean
desc(): Descending order
if_else(): Vectorised if-else
lag() lead(): Compute lagged or leading
values
n_distinct(): Count unique combinations
na_if(): Convert values to NA
near(): Compare two numeric vectors
nth() first() last(): Extract
the first, last, or nth value from a vector
ntile(): Bucket a numeric vector into n groups
order_by(): A helper function for ordering window
function output
percent_rank() cume_dist(): Proportional
ranking functions
recode() recode_factor(): Recode values
row_number() min_rank()
dense_rank(): Integer ranking functions
band_members band_instruments
band_instruments2: Band membership
starwars: Starwars characters storms: Storm
tracks data
This (mostly) experimental family of functions are used to manipulate groups in various ways.
group_cols(): Select grouping variables
group_map() group_modify()
group_walk(): Apply a function to each group
group_trim(): Trim grouping structure
Superseded functions have been replaced by new approaches that we believe to be superior, but we don’t want to force you to change until you’re ready, so the existing functions will stay around for several years.
sample_n() sample_frac(): Sample n rows
from a table
top_n() top_frac(): Select top (or bottom)
n rows (by value)
scoped: Operate on a selection of variables
all_vars() any_vars(): Apply predicate to
all variables
vars(): Select variables with_groups():
Perform an operation with temporary groups
auto_copy(): Copy tables to same source, if
necessary
compute() collect()
collapse(): Force computation of a database query
copy_to(): Copy a local data frame to a remote src
ident(): Flag a character vector as SQL identifiers
explain() show_query(): Explain details of
a tbl
tbl() is.tbl(): Create a table from a data
source
sql(): SQL escaping.
Bureau, US Census. “2022 Census of Governments, Survey of Public Employment & Payroll Datasets & Tables.” Census.gov, www.census.gov/data/datasets/2022/econ/apes/2022.html. Accessed 7 Dec. 2023.
Dwi Cahyaningtyas, Rany . “RPubs - R Markdown Themes Guide: Unleash Your Creativity with Captivating Document Styles.” Rpubs.com, 28 May 2023, rpubs.com/ranydc/rmarkdown_themes. Accessed 7 Dec. 2023.
Mak, Adrian. “U.S. Cost of Living Index by City: Downloadable Data – AdvisorSmith.” Advisorsmith.com, 2021, advisorsmith.com/data/coli/.
Sievert, Carson. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman and Hall/CRC, 2020, plotly-r.com.
Wickham, Hadley. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016, ggplot2.tidyverse.org.
Wickham, Hadley , et al. Dplyr: A Grammar of Data Manipulation. 2023, CRAN.R-project.org/package=dplyr.
Https://Www.databentobox.com/2019/11/01/Dplyr-With-String/Featured.png
https://dplyr.tidyverse.org/reference/index.html#vector-functions